Supervised Learning Classification Project: AllLife Bank Personal

Loan Campaign

Problem Statement

Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with

varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested

in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular,

the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as

depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged

the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers

who have a higher probability of purchasing the loan.

Objective

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving

purchases, and identify which segment of customers to target more.

Data Dictionary

ID : Customer ID

Age : Customer’s age in completed years

Experience : #years of professional experience

Income : Annual income of the customer (in thousand dollars)

ZIP Code : Home Address ZIP code.

Family : the Family size of the customer

CCAvg : Average spending on credit cards per month (in thousand dollars)

Education : Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional

Mortgage : Value of house mortgage if any. (in thousand dollars)

Personal_Loan : Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)

Securities_Account : Does the customer have securities account with the bank? (0: No, 1: Yes)

CD_Account : Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)

Online : Do customers use internet banking facilities? (0: No, 1: Yes)

CreditCard : Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries

import warnings

warnings.filterwarnings("ignore")

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

import seaborn as sns

pd.set_option("display.max_columns", None)

pd.set_option("display.max_rows", 200)

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn import tree

from sklearn.model_selection import GridSearchCV

# had to replace plot_confusion_matrix with ConfusionMatrixDisplay

from sklearn.metrics import (

f1_score,

accuracy_score,

Loading the dataset

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", for

ce_remount=True).

Data Overview

Observations

Sanity checks

ID

Age

Experience

Income

ZIPCode

Family

CCAvg

Education

Mortgage

Personal_Loan

Securities_Account

CD_Account

Online

C

0

1

25

1

49

91107

4

1.6

1

0

0

1

0

0

1

2

45

19

34

90089

3

1.5

1

0

0

1

0

0

2

3

39

15

11

94720

1

1.0

1

0

0

0

0

0

3

4

35

9

100

94112

1

2.7

2

0

0

0

0

0

4

5

35

8

45

91330

4

1.0

2

0

0

0

0

0

ID

Age

Experience

Income

ZIPCode

Family

CCAvg

Education

Mortgage

Personal_Loan

Securities_Account

CD_Account

Onli

4995

4996

29

3

40

92697

1

1.9

3

0

0

0

0

4996

4997

30

4

15

92037

4

0.4

1

85

0

0

0

4997

4998

63

39

24

93023

2

0.3

3

0

0

0

0

4998

4999

65

40

49

90034

3

0.5

2

0

0

0

0

4999

5000

28

4

83

92612

3

0.8

1

0

0

0

0

(5000, 14)

recall_score,

precision_score,

confusion_matrix,

roc_auc_score,

ConfusionMatrixDisplay,

precision_recall_curve,

roc_curve,

make_scorer,

)

from google.colab import drive

drive.mount('/content/drive')

Loan = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/PGP_in_AI-ML_@UT_Austin-Docs/Module_2/Machine_Learni

data = Loan.copy()

data.head()

data.tail()

data.shape

data.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 5000 entries, 0 to 4999

Data columns (total 14 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 ID 5000 non-null int64

1 Age 5000 non-null int64

2 Experience 5000 non-null int64

3 Income 5000 non-null int64

4 ZIPCode 5000 non-null int64

5 Family 5000 non-null int64

6 CCAvg 5000 non-null float64

7 Education 5000 non-null int64

8 Mortgage 5000 non-null int64

9 Personal_Loan 5000 non-null int64

10 Securities_Account 5000 non-null int64

11 CD_Account 5000 non-null int64

12 Online 5000 non-null int64

13 CreditCard 5000 non-null int64

dtypes: float64(1), int64(13)

memory usage: 547.0 KB

count

mean

std

min

25%

50%

75%

max

ID

5000.0

2500.500000

1443.520003

1.0

1250.75

2500.5

3750.25

5000.0

Age

5000.0

45.338400

11.463166

23.0

35.00

45.0

55.00

67.0

Experience

5000.0

20.104600

11.467954

-3.0

10.00

20.0

30.00

43.0

Income

5000.0

73.774200

46.033729

8.0

39.00

64.0

98.00

224.0

ZIPCode

5000.0

93169.257000

1759.455086

90005.0

91911.00

93437.0

94608.00

96651.0

Family

5000.0

2.396400

1.147663

1.0

1.00

2.0

3.00

4.0

CCAvg

5000.0

1.937938

1.747659

0.0

0.70

1.5

2.50

10.0

Education

5000.0

1.881000

0.839869

1.0

1.00

2.0

3.00

3.0

Mortgage

5000.0

56.498800

101.713802

0.0

0.00

0.0

101.00

635.0

Personal_Loan

5000.0

0.096000

0.294621

0.0

0.00

0.0

0.00

1.0

Securities_Account

5000.0

0.104400

0.305809

0.0

0.00

0.0

0.00

1.0

CD_Account

5000.0

0.060400

0.238250

0.0

0.00

0.0

0.00

1.0

Online

5000.0

0.596800

0.490589

0.0

0.00

1.0

1.00

1.0

CreditCard

5000.0

0.294000

0.455637

0.0

0.00

0.0

1.00

1.0

Exploratory Data Analysis.

EDA is an important part of any project involving data.

It is important to investigate and understand the data better before building a model with it.

A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights

from the data.

A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?

2. How many customers have credit cards?

3. What are the attributes that have a strong correlation with the target attribute (personal loan)?

4. How does a customer's interest in purchasing a loan vary with their age?

5. How does a customer's interest in purchasing a loan vary with their education?

data.describe().T

data = data.drop(columns=['ID'])

#This line creates a combined boxplot and histogram for a specified feature in a DataFrame.

#The boxplot displays the summary statistics, while the histogram represents the distribution of the data

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):

f2, (ax_box2, ax_hist2) = plt.subplots(

nrows=2,

sharex=True,

gridspec_kw={"height_ratios": (0.25, 0.75)},

figsize=figsize,

)

sns.boxplot(

data=data, x=feature, ax=ax_box2, showmeans=True, color="xkcd:pale purple"

)

sns.histplot(

data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, color="green"

) if bins else sns.histplot(

data=data, x=feature, kde=kde, ax=ax_hist2

)

ax_hist2.axvline(

data[feature].mean(), color="green", linestyle="--"

)

ax_hist2.axvline(

data[feature].median(), color="black", linestyle="-"

)

#labeled_barplot function generates a bar plot for a specified column from a Pandas dataframe with either the c

#this function creates the bar plot with a customized color palette,

#uses matplotlib functionalities to rotate the x-axis labels for better readability

#set the figure size dynamically based on the number of unique values to be displayed, and annotate each bar wi

def labeled_barplot(data, feature, perc=False, n=None):

total = len(data[feature])

count = data[feature].nunique()

if n is None:

plt.figure(figsize=(count + 1, 5))

else:

plt.figure(figsize=(n + 1, 5))

plt.xticks(rotation=90, fontsize=15)

ax = sns.countplot(

data=data,

x=feature,

palette="Paired",

order=data[feature].value_counts().index[:n].sort_values(),

)

for p in ax.patches:

if perc == True:

label = "{:.1f}%".format(

100 * p.get_height() / total

)

else:

label = p.get_height()

x = p.get_x() + p.get_width() / 2

y = p.get_height()

ax.annotate(

label,

(x, y),

ha="center",

va="center",

size=12,

xytext=(0, 5),

textcoords="offset points",

)

plt.show()

#This function generates a combined histogram and boxplot visualization for the "Age" column in the

#data frame, to facilitate the statistical analysis of the age distribution in the dataset.

histogram_boxplot(data, "Age")

#using Seaborn and Matplotlib libraries to create a vertical 2-row subplot:

#the first row contains a histogram with a KDE (Kernel Density Estimate) plot overlay representing

#the distribution of values in the 'Experience' column of the 'data' DataFrame, and the second row

#contains a boxplot of the same data to give insights into the statistical properties of the 'Experience'

#column (such as the median, interquartile range, etc.).

sns.set_style("darkgrid")

#combo

fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(7, 10))

#this is the histogram code

sns.histplot(data['Experience'], bins=10, kde=True, ax=axes[0])

axes[0].set_title('Histogram of Experience')

axes[0].set_xlabel('Years of Experience')

axes[0].set_ylabel('Frequency')

#this is the boxplot code

sns.boxplot(x=data['Experience'], ax=axes[1])

axes[1].set_title('Boxplot of Experience')

axes[1].set_xlabel('Years of Experience')

plt.tight_layout()

plt.show()

#Same data as above^ is being analyzed, but instead shown in a Histogram and box plot. Will use this same code

histogram_boxplot(data, "Experience")

#This function generates a combined histogram and boxplot visualization for the "Income" column in the

#data frame, to facilitate the statistical analysis of the income distribution in the dataset.

histogram_boxplot(data, "Income")

#This function generates a combined histogram and boxplot visualization for the "CCAvg." column in the

#data frame, to facilitate the statistical analysis of the CCAvg distribution in the dataset.

histogram_boxplot(data, "CCAvg")

#This function generates a combined histogram and boxplot visualization for the "Mortgage" column in the

#data frame, to facilitate the statistical analysis of the Mortgage distribution in the dataset.

histogram_boxplot(data, "Mortgage")

#column from that dataset referred to as "Family", and boolean flag set to True, instructs the function to disp

#percentages on the bar plot instead of counts.

#labeled_barplot function creates a bar plot visualizing the distribution of values in the "Family" column, with

#with the respective percentage of total counts it represents.

labeled_barplot(data, "Family", perc=True)

#column from that dataset referred to as "Education", and boolean flag set to True, instructs the function to di

#percentages on the bar plot instead of counts.

#labeled_barplot function creates a bar plot visualizing the distribution of values in the "Education" column, w

#with the respective percentage of total counts it represents.

labeled_barplot(data, "Education",perc=True)

#column from that dataset referred to as "Securities_Account", and boolean flag set to True, instructs the func

#percentages on the bar plot instead of counts.

#labeled_barplot function creates a bar plot visualizing the distribution of values in the "Securities_Account"

#with the respective percentage of total counts it represents.

labeled_barplot(data,"Securities_Account")

#column from that dataset referred to as "CD_Account", and boolean flag set to True, instructs the function to

#percentages on the bar plot instead of counts.

#labeled_barplot function creates a bar plot visualizing the distribution of values in the "CD_Account" column,

#with the respective percentage of total counts it represents.

labeled_barplot(data,"CD_Account")

#column from that dataset referred to as "Online", and boolean flag set to True, instructs the function to disp

#percentages on the bar plot instead of counts.

#labeled_barplot function creates a bar plot visualizing the distribution of values in the "Online" column, with

#with the respective percentage of total counts it represents.

labeled_barplot(data,"Online")

#column from that dataset referred to as "CreditCard", and boolean flag set to True, instructs the function to

#percentages on the bar plot instead of counts.

#labeled_barplot function creates a bar plot visualizing the distribution of values in the "CreditCard" column,

#with the respective percentage of total counts it represents.

labeled_barplot(data,"CreditCard")

#There are so many ZIPCodes, showing visiblity here and will show top 10 on next line of code.

zip_counts = data['ZIPCode'].value_counts()

plt.figure(figsize=(10,6))

barplot = sns.barplot(x=zip_counts.index, y=zip_counts.values, palette='viridis')

for p in barplot.patches:

barplot.annotate(format(p.get_height(), '.0f'),

(p.get_x() + p.get_width() / 2., p.get_height()),

ha='center', va='center',

xytext=(0, 9),

textcoords='offset points')

plt.title('Frequency of ZIP Codes')

plt.ylabel('Frequency')

plt.xlabel('ZIP Code')

plt.xticks(rotation=45)

plt.show()

top_10_zip_counts = data['ZIPCode'].value_counts().head(10)

#Using barplot to visualize the top 10 most frequent ZIP codes in the top_10_zip_counts data.

#The bars represent different ZIP codes (on the X-axis) and their respective frequencies (on the Y-axis);

#each bar is annotated with its exact frequency value at the top center, and the X-axis labels are rotated by 45

plt.figure(figsize=(12,7))

barplot = sns.barplot(x=top_10_zip_counts.index, y=top_10_zip_counts.values, palette='viridis')

for p in barplot.patches:

barplot.annotate(format(p.get_height(), '.0f'),

(p.get_x() + p.get_width() / 2., p.get_height()),

ha='center', va='center',

xytext=(0, 9),

textcoords='offset points')

plt.title('Top 10 Most Frequent ZIP Codes')

plt.ylabel('Frequency')

plt.xlabel('ZIP Code')

plt.xticks(rotation=45)

plt.show()

#stacked_barplot functioncreates two cross-tabulations (contingency tables): one displaying the counts of each

#combination of the predictor and target variables (tab1) and the other displaying these counts as proportions

def stacked_barplot(data, predictor, target):

count = data[predictor].nunique()

sorter = data[target].value_counts().index[-1]

tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(

by=sorter, ascending=False

)

print(tab1)

print("-" * 120)

tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(

by=sorter, ascending=False

)

tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))

plt.legend(

loc="lower left", frameon=False,

)

plt.legend(loc="upper left", bbox_to_anchor=(1, 1))

plt.show()

#distribution_plot_wrt_target function generates a 2x2 grid of plots to visually analyze the distribution of a p

#variable with respect to two unique target variable values

#it also uses histograms with KDE for individual target values and box plots to compare the predictor variable

def distribution_plot_wrt_target(data, predictor, target):

fig, axs = plt.subplots(2, 2, figsize=(12, 10))

target_uniq = data[target].unique()

axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))

sns.histplot(

data=data[data[target] == target_uniq[0]],

x=predictor,

kde=True,

ax=axs[0, 0],

color="teal",

stat="density",

)

axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))

sns.histplot(

data=data[data[target] == target_uniq[1]],

x=predictor,

kde=True,

ax=axs[0, 1],

color="orange",

stat="density",

)

axs[1, 0].set_title("Boxplot w.r.t target")

sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

Personal_Loan 0 1 All

Education

All 4520 480 5000

3 1296 205 1501

2 1221 182 1403

1 2003 93 2096

---------------------------------------------------------------------------------------------------------------

---------

axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")

sns.boxplot(

data=data,

x=target,

y=predictor,

ax=axs[1, 1],

showfliers=False,

palette="gist_rainbow",

)

plt.tight_layout()

plt.show()

#displays a heatmap of the data

plt.figure(figsize=(15, 7))

sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")

plt.show()

#This line of code calls the stacked_barplot function with "Education" as the predictor variable

#and "Personal_Loan" as the target variable to create a stacked bar plot visualizing the distribution

#of personal loan acceptance across different education levels using the data from the data dataframe.

stacked_barplot(data, "Education", "Personal_Loan")

#This line creates and displays a stacked bar plot that illustrates the distribution

#of personal loans across different family sizes/types, using data grouped by the "Family" and "Personal_Loan"

pivot_data = data.groupby(['Family', 'Personal_Loan']).size().unstack()

pivot_data.plot(kind='bar', stacked=True, figsize=(10, 7))

plt.title("Stacked Barplot of Personal Loan by Family")

plt.ylabel("Count")

plt.xlabel("Family Size/Type")

plt.show()

Personal_Loan 0 1 All

Securities_Account

All 4520 480 5000

0 4058 420 4478

1 462 60 522

---------------------------------------------------------------------------------------------------------------

---------

CD_Account 0 1 All

Personal_Loan

All 4698 302 5000

0 4358 162 4520

1 340 140 480

---------------------------------------------------------------------------------------------------------------

---------

#This line of code creates a stacked bar plot visualizing the relationships between the "Securities_Account" an

stacked_barplot(data, "Securities_Account", "Personal_Loan")

#This line of code creates a stacked bar plot visualizing the relationships between the "CD_Account" and "Person

stacked_barplot(data, "Personal_Loan", "CD_Account")

Online 0 1 All

Personal_Loan

All 2016 2984 5000

0 1827 2693 4520

1 189 291 480

---------------------------------------------------------------------------------------------------------------

---------

CreditCard 0 1 All

Personal_Loan

All 3530 1470 5000

0 3193 1327 4520

1 337 143 480

---------------------------------------------------------------------------------------------------------------

---------

#This line of code creates a stacked bar plot visualizing the relationships between the "Online" and "Personal_L

stacked_barplot(data, "Personal_Loan", "Online")

#This line of code creates a stacked bar plot visualizing the relationships between the "CreditCard" and "Person

stacked_barplot(data, "Personal_Loan", "CreditCard")

#TOO MANY ZIP CODES TO PLOT, USING TOP 10, stacked_barplot takes a dataframe, two column names (target and categ

#and an optional parameter specifying the number of top categories to consider, creating a stacked bar plot vis

#the distribution of the target column categories across the top N categories (plus an 'Other' category for all

#finally, it cleans up by removing the additional column it created for aggregation.

def stacked_barplot(data, target_col, category_col, top_n=10):

# Get the top N categories

top_categories = data[category_col].value_counts().head(top_n).index.tolist()

data['agg_category'] = data[category_col].apply(lambda x: x if x in top_categories else "Other")

pivot_data = data.groupby(['agg_category', target_col]).size().unstack()

pivot_data.sort_values(by=1, ascending=False).plot(kind='bar', stacked=True, figsize=(12, 7))

plt.title(f"Stacked Barplot of {target_col} by Top {top_n} {category_col}")

plt.ylabel("Count")

plt.xlabel(category_col)

plt.xticks(rotation=45)

plt.show()

data.drop(columns='agg_category', inplace=True)

#This line of code creates a stacked bar plot visualizing the relationships between the "ZIPCode" and "Personal_

stacked_barplot(data, 'Personal_Loan', 'ZIPCode')

#This line will plot the distribution of ages for those with and without a personal loan, along with box plots

distribution_plot_wrt_target(data, "Age", "Personal_Loan")

#distribution_plot_wrt_target is being called with "Experience" as the predictor variable and "Personal_Loan" a

#and will visualize the distribution of "Experience" values for different "Personal_Loan" categories through a

distribution_plot_wrt_target(data, "Experience", "Personal_Loan")

#distribution_plot_wrt_target is being called with "Income" as the predictor variable and "Personal_Loan" as the

#and will visualize the distribution of "Income" values for different "Personal_Loan" categories through a serie

distribution_plot_wrt_target(data, "Income", "Personal_Loan")

#distribution_plot_wrt_target is being called and it is plotting the distribution of "CCAvg" (Credit Card Averag

#with respect to the "Personal_Loan" target variable, showcasing the relationship between

#credit card spending averages and personal loan statuses.

distribution_plot_wrt_target(data, "CCAvg", "Personal_Loan")

Data Preprocessing

Missing value treatment

Feature engineering (if needed)

Outlier detection and treatment (if needed)

Preparing data for modeling

Any other preprocessing steps (if needed)

Age 0.00

Experience 0.00

Income 1.92

ZIPCode 0.00

Family 0.00

CCAvg 6.48

Education 0.00

Mortgage 5.82

Personal_Loan 9.60

Securities_Account 10.44

CD_Account 6.04

Online 0.00

CreditCard 0.00

dtype: float64

#This line detects outliers

Q1 = data.quantile(0.25)

Q3 = data.quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR

upper = Q3 + 1.5 * IQR

((data.select_dtypes(include=["float64", "int64"]) < lower)

|(data.select_dtypes(include=["float64", "int64"]) > upper)

).sum() / len(data) * 100

#creating a feature matrix "X" by dropping the "Personal_Loan" and "Experience" columns from the data, and a ta

Shape of Training set : (3500, 477)

Shape of test set : (1500, 477)

Percentage of classes in training set:

0 0.905429

1 0.094571

Name: Personal_Loan, dtype: float64

Percentage of classes in test set:

0 0.900667

1 0.099333

Name: Personal_Loan, dtype: float64

Model Building

Model Evaluation Criterion

Accuracy

Recall

Precision

F1

0

1.0

1.0

1.0

1.0

X = data.drop(["Personal_Loan", "Experience"], axis=1)

Y = data["Personal_Loan"]

#This line of code is applying one-hot encoding to the "ZIPCode" and "Education" columns of the data, and then

X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state=1)

#This line prints the shapes of the training and test datasets and displays the percentage distribution of the

print("Shape of Training set : ", X_train.shape)

print("Shape of test set : ", X_test.shape)

print("Percentage of classes in training set:")

print(y_train.value_counts(normalize=True))

print("Percentage of classes in test set:")

print(y_test.value_counts(normalize=True))

#This line defines(accuracy, recall, precision, and F1 score) and plot a confusion matrix using seaborn's heatma

#visually representing the performance of the classification model on the data.

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix

def model_performance_and_confusion_matrix(model, predictors, target):

pred = model.predict(predictors) #pred is computed once and used in the metrics calculation AND the confusi

metrics = {

"Accuracy": accuracy_score(target, pred),

"Recall": recall_score(target, pred),

"Precision": precision_score(target, pred),

"F1": f1_score(target, pred)

}

df_perf = pd.DataFrame(metrics, index=[0])

cm = confusion_matrix(target, pred)

labels = np.asarray(

[

["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]

for item in cm.flatten()

]

).reshape(2, 2)

plt.figure(figsize=(6, 4))

sns.heatmap(cm, annot=labels, fmt="")

plt.ylabel("True label")

plt.xlabel("Predicted label")

return df_perf #The function will return the performance metrics DataFrame and plots the confusion matrix

#This line of code initializes a Decision Tree classifier with the "gini" criterion and a fixed random state, an

model = DecisionTreeClassifier(criterion="gini", random_state=1)

model.fit(X_train, y_train)

#This evaluates the model's performance using the training data, by calculating various metrics

#(accuracy, recall, precision, and F1 score) and visualizing the confusion matrix,

#then displaying it

model_performance_and_confusion_matrix(model, X_train, y_train)

DecisionTreeClassifier

DecisionTreeClassifier(random_state=1)

Accuracy

Recall

Precision

F1

0

1.0

1.0

1.0

1.0

Visualizing Decision Tree

#This line will evaluate the decision tree model's performance on the training data, and store the performance m

decision_tree_perf_train = model_performance_and_confusion_matrix (model, X_train, y_train)

decision_tree_perf_train

#This line creates a list of feature names from the columns of the X_train dataframe, then prints it

feature_names = list(X_train.columns)

print(feature_names)

['Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'Z

IPCode_90007', 'ZIPCode_90009', 'ZIPCode_90011', 'ZIPCode_90016', 'ZIPCode_90018', 'ZIPCode_90019', 'ZIPCode_90

024', 'ZIPCode_90025', 'ZIPCode_90027', 'ZIPCode_90028', 'ZIPCode_90029', 'ZIPCode_90032', 'ZIPCode_90033', 'ZI

PCode_90034', 'ZIPCode_90035', 'ZIPCode_90036', 'ZIPCode_90037', 'ZIPCode_90041', 'ZIPCode_90044', 'ZIPCode_900

45', 'ZIPCode_90048', 'ZIPCode_90049', 'ZIPCode_90057', 'ZIPCode_90058', 'ZIPCode_90059', 'ZIPCode_90064', 'ZIP

Code_90065', 'ZIPCode_90066', 'ZIPCode_90068', 'ZIPCode_90071', 'ZIPCode_90073', 'ZIPCode_90086', 'ZIPCode_9008

9', 'ZIPCode_90095', 'ZIPCode_90210', 'ZIPCode_90212', 'ZIPCode_90230', 'ZIPCode_90232', 'ZIPCode_90245', 'ZIPC

ode_90250', 'ZIPCode_90254', 'ZIPCode_90266', 'ZIPCode_90272', 'ZIPCode_90274', 'ZIPCode_90275', 'ZIPCode_90277

', 'ZIPCode_90280', 'ZIPCode_90291', 'ZIPCode_90304', 'ZIPCode_90401', 'ZIPCode_90404', 'ZIPCode_90405', 'ZIPCo

de_90502', 'ZIPCode_90503', 'ZIPCode_90504', 'ZIPCode_90505', 'ZIPCode_90509', 'ZIPCode_90601', 'ZIPCode_90623'

, 'ZIPCode_90630', 'ZIPCode_90638', 'ZIPCode_90639', 'ZIPCode_90640', 'ZIPCode_90650', 'ZIPCode_90717', 'ZIPCod

e_90720', 'ZIPCode_90740', 'ZIPCode_90745', 'ZIPCode_90747', 'ZIPCode_90755', 'ZIPCode_90813', 'ZIPCode_90840',

'ZIPCode_91006', 'ZIPCode_91007', 'ZIPCode_91016', 'ZIPCode_91024', 'ZIPCode_91030', 'ZIPCode_91040', 'ZIPCode_

91101', 'ZIPCode_91103', 'ZIPCode_91105', 'ZIPCode_91107', 'ZIPCode_91109', 'ZIPCode_91116', 'ZIPCode_91125', '

ZIPCode_91129', 'ZIPCode_91203', 'ZIPCode_91207', 'ZIPCode_91301', 'ZIPCode_91302', 'ZIPCode_91304', 'ZIPCode_9

1311', 'ZIPCode_91320', 'ZIPCode_91326', 'ZIPCode_91330', 'ZIPCode_91335', 'ZIPCode_91342', 'ZIPCode_91343', 'Z

IPCode_91345', 'ZIPCode_91355', 'ZIPCode_91360', 'ZIPCode_91361', 'ZIPCode_91365', 'ZIPCode_91367', 'ZIPCode_91

380', 'ZIPCode_91401', 'ZIPCode_91423', 'ZIPCode_91604', 'ZIPCode_91605', 'ZIPCode_91614', 'ZIPCode_91706', 'ZI

PCode_91709', 'ZIPCode_91710', 'ZIPCode_91711', 'ZIPCode_91730', 'ZIPCode_91741', 'ZIPCode_91745', 'ZIPCode_917

54', 'ZIPCode_91763', 'ZIPCode_91765', 'ZIPCode_91768', 'ZIPCode_91770', 'ZIPCode_91773', 'ZIPCode_91775', 'ZIP

Code_91784', 'ZIPCode_91791', 'ZIPCode_91801', 'ZIPCode_91902', 'ZIPCode_91910', 'ZIPCode_91911', 'ZIPCode_9194

1', 'ZIPCode_91942', 'ZIPCode_91950', 'ZIPCode_92007', 'ZIPCode_92008', 'ZIPCode_92009', 'ZIPCode_92024', 'ZIPC

ode_92028', 'ZIPCode_92029', 'ZIPCode_92037', 'ZIPCode_92038', 'ZIPCode_92054', 'ZIPCode_92056', 'ZIPCode_92064

', 'ZIPCode_92068', 'ZIPCode_92069', 'ZIPCode_92084', 'ZIPCode_92093', 'ZIPCode_92096', 'ZIPCode_92101', 'ZIPCo

de_92103', 'ZIPCode_92104', 'ZIPCode_92106', 'ZIPCode_92109', 'ZIPCode_92110', 'ZIPCode_92115', 'ZIPCode_92116'

, 'ZIPCode_92120', 'ZIPCode_92121', 'ZIPCode_92122', 'ZIPCode_92123', 'ZIPCode_92124', 'ZIPCode_92126', 'ZIPCod

e_92129', 'ZIPCode_92130', 'ZIPCode_92131', 'ZIPCode_92152', 'ZIPCode_92154', 'ZIPCode_92161', 'ZIPCode_92173',

'ZIPCode_92177', 'ZIPCode_92182', 'ZIPCode_92192', 'ZIPCode_92220', 'ZIPCode_92251', 'ZIPCode_92325', 'ZIPCode_

92333', 'ZIPCode_92346', 'ZIPCode_92350', 'ZIPCode_92354', 'ZIPCode_92373', 'ZIPCode_92374', 'ZIPCode_92399', '

ZIPCode_92407', 'ZIPCode_92507', 'ZIPCode_92518', 'ZIPCode_92521', 'ZIPCode_92606', 'ZIPCode_92612', 'ZIPCode_9

2614', 'ZIPCode_92624', 'ZIPCode_92626', 'ZIPCode_92630', 'ZIPCode_92634', 'ZIPCode_92646', 'ZIPCode_92647', 'Z

IPCode_92648', 'ZIPCode_92653', 'ZIPCode_92660', 'ZIPCode_92661', 'ZIPCode_92672', 'ZIPCode_92673', 'ZIPCode_92

675', 'ZIPCode_92677', 'ZIPCode_92691', 'ZIPCode_92692', 'ZIPCode_92694', 'ZIPCode_92697', 'ZIPCode_92703', 'ZI

PCode_92704', 'ZIPCode_92705', 'ZIPCode_92709', 'ZIPCode_92717', 'ZIPCode_92735', 'ZIPCode_92780', 'ZIPCode_928

06', 'ZIPCode_92807', 'ZIPCode_92821', 'ZIPCode_92831', 'ZIPCode_92833', 'ZIPCode_92834', 'ZIPCode_92835', 'ZIP

Code_92843', 'ZIPCode_92866', 'ZIPCode_92867', 'ZIPCode_92868', 'ZIPCode_92870', 'ZIPCode_92886', 'ZIPCode_9300

3', 'ZIPCode_93009', 'ZIPCode_93010', 'ZIPCode_93014', 'ZIPCode_93022', 'ZIPCode_93023', 'ZIPCode_93033', 'ZIPC

ode_93063', 'ZIPCode_93065', 'ZIPCode_93077', 'ZIPCode_93101', 'ZIPCode_93105', 'ZIPCode_93106', 'ZIPCode_93107

', 'ZIPCode_93108', 'ZIPCode_93109', 'ZIPCode_93111', 'ZIPCode_93117', 'ZIPCode_93118', 'ZIPCode_93302', 'ZIPCo

de_93305', 'ZIPCode_93311', 'ZIPCode_93401', 'ZIPCode_93403', 'ZIPCode_93407', 'ZIPCode_93437', 'ZIPCode_93460'

, 'ZIPCode_93524', 'ZIPCode_93555', 'ZIPCode_93561', 'ZIPCode_93611', 'ZIPCode_93657', 'ZIPCode_93711', 'ZIPCod

e_93720', 'ZIPCode_93727', 'ZIPCode_93907', 'ZIPCode_93933', 'ZIPCode_93940', 'ZIPCode_93943', 'ZIPCode_93950',

'ZIPCode_93955', 'ZIPCode_94002', 'ZIPCode_94005', 'ZIPCode_94010', 'ZIPCode_94015', 'ZIPCode_94019', 'ZIPCode_

94022', 'ZIPCode_94024', 'ZIPCode_94025', 'ZIPCode_94028', 'ZIPCode_94035', 'ZIPCode_94040', 'ZIPCode_94043', '

ZIPCode_94061', 'ZIPCode_94063', 'ZIPCode_94065', 'ZIPCode_94066', 'ZIPCode_94080', 'ZIPCode_94085', 'ZIPCode_9

4086', 'ZIPCode_94087', 'ZIPCode_94102', 'ZIPCode_94104', 'ZIPCode_94105', 'ZIPCode_94107', 'ZIPCode_94108', 'Z

IPCode_94109', 'ZIPCode_94110', 'ZIPCode_94111', 'ZIPCode_94112', 'ZIPCode_94114', 'ZIPCode_94115', 'ZIPCode_94

116', 'ZIPCode_94117', 'ZIPCode_94118', 'ZIPCode_94122', 'ZIPCode_94123', 'ZIPCode_94124', 'ZIPCode_94126', 'ZI

PCode_94131', 'ZIPCode_94132', 'ZIPCode_94143', 'ZIPCode_94234', 'ZIPCode_94301', 'ZIPCode_94302', 'ZIPCode_943

03', 'ZIPCode_94304', 'ZIPCode_94305', 'ZIPCode_94306', 'ZIPCode_94309', 'ZIPCode_94402', 'ZIPCode_94404', 'ZIP

Code_94501', 'ZIPCode_94507', 'ZIPCode_94509', 'ZIPCode_94521', 'ZIPCode_94523', 'ZIPCode_94526', 'ZIPCode_9453

4', 'ZIPCode_94536', 'ZIPCode_94538', 'ZIPCode_94539', 'ZIPCode_94542', 'ZIPCode_94545', 'ZIPCode_94546', 'ZIPC

ode_94550', 'ZIPCode_94551', 'ZIPCode_94553', 'ZIPCode_94555', 'ZIPCode_94558', 'ZIPCode_94566', 'ZIPCode_94571

', 'ZIPCode_94575', 'ZIPCode_94577', 'ZIPCode_94583', 'ZIPCode_94588', 'ZIPCode_94590', 'ZIPCode_94591', 'ZIPCo

de_94596', 'ZIPCode_94598', 'ZIPCode_94604', 'ZIPCode_94606', 'ZIPCode_94607', 'ZIPCode_94608', 'ZIPCode_94609'

, 'ZIPCode_94610', 'ZIPCode_94611', 'ZIPCode_94612', 'ZIPCode_94618', 'ZIPCode_94701', 'ZIPCode_94703', 'ZIPCod

e_94704', 'ZIPCode_94705', 'ZIPCode_94706', 'ZIPCode_94707', 'ZIPCode_94708', 'ZIPCode_94709', 'ZIPCode_94710',

'ZIPCode_94720', 'ZIPCode_94801', 'ZIPCode_94803', 'ZIPCode_94806', 'ZIPCode_94901', 'ZIPCode_94904', 'ZIPCode_

94920', 'ZIPCode_94923', 'ZIPCode_94928', 'ZIPCode_94939', 'ZIPCode_94949', 'ZIPCode_94960', 'ZIPCode_94965', '

ZIPCode_94970', 'ZIPCode_94998', 'ZIPCode_95003', 'ZIPCode_95005', 'ZIPCode_95006', 'ZIPCode_95008', 'ZIPCode_9

5010', 'ZIPCode_95014', 'ZIPCode_95020', 'ZIPCode_95023', 'ZIPCode_95032', 'ZIPCode_95035', 'ZIPCode_95037', 'Z

IPCode_95039', 'ZIPCode_95045', 'ZIPCode_95051', 'ZIPCode_95053', 'ZIPCode_95054', 'ZIPCode_95060', 'ZIPCode_95

064', 'ZIPCode_95070', 'ZIPCode_95112', 'ZIPCode_95120', 'ZIPCode_95123', 'ZIPCode_95125', 'ZIPCode_95126', 'ZI

PCode_95131', 'ZIPCode_95133', 'ZIPCode_95134', 'ZIPCode_95135', 'ZIPCode_95136', 'ZIPCode_95138', 'ZIPCode_951

92', 'ZIPCode_95193', 'ZIPCode_95207', 'ZIPCode_95211', 'ZIPCode_95307', 'ZIPCode_95348', 'ZIPCode_95351', 'ZIP

Code_95354', 'ZIPCode_95370', 'ZIPCode_95403', 'ZIPCode_95405', 'ZIPCode_95422', 'ZIPCode_95449', 'ZIPCode_9548

2', 'ZIPCode_95503', 'ZIPCode_95518', 'ZIPCode_95521', 'ZIPCode_95605', 'ZIPCode_95616', 'ZIPCode_95617', 'ZIPC

ode_95621', 'ZIPCode_95630', 'ZIPCode_95670', 'ZIPCode_95678', 'ZIPCode_95741', 'ZIPCode_95747', 'ZIPCode_95758

', 'ZIPCode_95762', 'ZIPCode_95812', 'ZIPCode_95814', 'ZIPCode_95816', 'ZIPCode_95817', 'ZIPCode_95818', 'ZIPCo

de_95819', 'ZIPCode_95820', 'ZIPCode_95821', 'ZIPCode_95822', 'ZIPCode_95825', 'ZIPCode_95827', 'ZIPCode_95828'

, 'ZIPCode_95831', 'ZIPCode_95833', 'ZIPCode_95841', 'ZIPCode_95842', 'ZIPCode_95929', 'ZIPCode_95973', 'ZIPCod

e_96001', 'ZIPCode_96003', 'ZIPCode_96008', 'ZIPCode_96064', 'ZIPCode_96091', 'ZIPCode_96094', 'ZIPCode_96145',

'ZIPCode_96150', 'ZIPCode_96651', 'Education_2', 'Education_3']

#creates filled nodes, defined font size, and adding black edges around the arrows, and then display it. Bottom

plt.figure(figsize=(20, 30))

out = tree.plot_tree(

model,

feature_names=feature_names,

filled=True,

fontsize=9,

node_ids=False,

class_names=None,

)

for o in out:

arrow = o.arrow_patch

if arrow is not None:

arrow.set_edgecolor("black")

arrow.set_linewidth(1)

|--- Income <= 116.50

| |--- CCAvg <= 2.95

plt.show()

#Will print a text representation of the decision tree model, including the names of the features used for each

print(tree.export_text(model, feature_names=feature_names, show_weights=True))

| | |--- Income <= 106.50

| | | |--- weights: [2553.00, 0.00] class: 0

| | |--- Income > 106.50

| | | |--- Family <= 3.50

| | | | |--- ZIPCode_90049 <= 0.50

| | | | | |--- ZIPCode_92007 <= 0.50

| | | | | | |--- ZIPCode_93106 <= 0.50

| | | | | | | |--- weights: [63.00, 0.00] class: 0

| | | | | | |--- ZIPCode_93106 > 0.50

| | | | | | | |--- weights: [0.00, 1.00] class: 1

| | | | | |--- ZIPCode_92007 > 0.50

| | | | | | |--- weights: [0.00, 1.00] class: 1

| | | | |--- ZIPCode_90049 > 0.50

| | | | | |--- weights: [0.00, 1.00] class: 1

| | | |--- Family > 3.50

| | | | |--- Age <= 32.50

| | | | | |--- CCAvg <= 2.40

| | | | | | |--- weights: [12.00, 0.00] class: 0

| | | | | |--- CCAvg > 2.40

| | | | | | |--- weights: [0.00, 1.00] class: 1

| | | | |--- Age > 32.50

| | | | | |--- Age <= 60.00

| | | | | | |--- weights: [0.00, 6.00] class: 1

| | | | | |--- Age > 60.00

| | | | | | |--- weights: [4.00, 0.00] class: 0

| |--- CCAvg > 2.95

| | |--- Income <= 92.50

| | | |--- CD_Account <= 0.50

| | | | |--- ZIPCode_91360 <= 0.50

| | | | | |--- ZIPCode_92220 <= 0.50

| | | | | | |--- ZIPCode_94709 <= 0.50

| | | | | | | |--- ZIPCode_92521 <= 0.50

| | | | | | | | |--- ZIPCode_91203 <= 0.50

| | | | | | | | | |--- ZIPCode_94122 <= 0.50

| | | | | | | | | | |--- ZIPCode_94105 <= 0.50

| | | | | | | | | | | |--- truncated branch of depth 5

| | | | | | | | | | |--- ZIPCode_94105 > 0.50

| | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1

| | | | | | | | | |--- ZIPCode_94122 > 0.50

| | | | | | | | | | |--- weights: [0.00, 1.00] class: 1

| | | | | | | | |--- ZIPCode_91203 > 0.50

| | | | | | | | | |--- weights: [0.00, 1.00] class: 1

| | | | | | | |--- ZIPCode_92521 > 0.50

| | | | | | | | |--- weights: [0.00, 1.00] class: 1

| | | | | | |--- ZIPCode_94709 > 0.50

| | | | | | | |--- weights: [0.00, 1.00] class: 1

| | | | | |--- ZIPCode_92220 > 0.50

| | | | | | |--- weights: [0.00, 1.00] class: 1

| | | | |--- ZIPCode_91360 > 0.50

| | | | | |--- weights: [0.00, 1.00] class: 1

| | | |--- CD_Account > 0.50

| | | | |--- weights: [0.00, 5.00] class: 1

| | |--- Income > 92.50

| | | |--- Family <= 2.50

| | | | |--- Education_2 <= 0.50

| | | | | |--- Education_3 <= 0.50

| | | | | | |--- CD_Account <= 0.50

| | | | | | | |--- ZIPCode_90034 <= 0.50

| | | | | | | | |--- weights: [28.00, 0.00] class: 0

| | | | | | | |--- ZIPCode_90034 > 0.50

| | | | | | | | |--- Income <= 103.50

| | | | | | | | | |--- weights: [0.00, 1.00] class: 1

| | | | | | | | |--- Income > 103.50

| | | | | | | | | |--- weights: [1.00, 0.00] class: 0

| | | | | | |--- CD_Account > 0.50

| | | | | | | |--- CCAvg <= 4.75

| | | | | | | | |--- weights: [0.00, 2.00] class: 1

| | | | | | | |--- CCAvg > 4.75

| | | | | | | | |--- weights: [1.00, 0.00] class: 0

| | | | | |--- Education_3 > 0.50

| | | | | | |--- CCAvg <= 3.95

| | | | | | | |--- ZIPCode_90277 <= 0.50

| | | | | | | | |--- weights: [0.00, 5.00] class: 1

| | | | | | | |--- ZIPCode_90277 > 0.50

| | | | | | | | |--- weights: [1.00, 0.00] class: 0

| | | | | | |--- CCAvg > 3.95

| | | | | | | |--- Income <= 107.00

| | | | | | | | |--- weights: [6.00, 0.00] class: 0

| | | | | | | |--- Income > 107.00

| | | | | | | | |--- weights: [0.00, 2.00] class: 1

| | | | |--- Education_2 > 0.50

| | | | | |--- weights: [0.00, 4.00] class: 1

| | | |--- Family > 2.50

| | | | |--- Age <= 57.50

| | | | | |--- ZIPCode_90245 <= 0.50

| | | | | | |--- weights: [0.00, 20.00] class: 1

| | | | | |--- ZIPCode_90245 > 0.50

| | | | | | |--- weights: [1.00, 0.00] class: 0

| | | | |--- Age > 57.50

| | | | | |--- Income <= 97.50

| | | | | | |--- weights: [0.00, 2.00] class: 1

| | | | | |--- Income > 97.50

| | | | | | |--- ZIPCode_94606 <= 0.50

| | | | | | | |--- weights: [7.00, 0.00] class: 0

| | | | | | |--- ZIPCode_94606 > 0.50

| | | | | | | |--- weights: [0.00, 1.00] class: 1

|--- Income > 116.50

| |--- Family <= 2.50

| | |--- Education_3 <= 0.50

| | | |--- Education_2 <= 0.50

| | | | |--- weights: [375.00, 0.00] class: 0

| | | |--- Education_2 > 0.50

| | | | |--- weights: [0.00, 53.00] class: 1

| | |--- Education_3 > 0.50

| | | |--- weights: [0.00, 62.00] class: 1

| |--- Family > 2.50

| | |--- weights: [0.00, 154.00] class: 1

Imp

Income 0.308577

Family 0.246862

Education_2 0.165238

Education_3 0.144207

CCAvg 0.048662

... ...

ZIPCode_92110 0.000000

ZIPCode_92109 0.000000

ZIPCode_92106 0.000000

ZIPCode_92104 0.000000

ZIPCode_93009 0.000000

[477 rows x 1 columns]

#displays the importance scores of each feature used in the decision tree model, sorted in descending order of i

print(

pd.DataFrame(

model.feature_importances_, columns=["Imp"], index=X_train.columns

).sort_values(by="Imp", ascending=False)

)

#creating a horizontal bar chart that visualizes the relative importance of each feature used in the decision t

importances = model.feature_importances_

indices = np.argsort(importances)

plt.figure(figsize=(8, 8))

plt.title("Feature Importances")

plt.barh(range(len(indices)), importances[indices], color="green", align="center")

plt.yticks(range(len(indices)), [feature_names[i] for i in indices])

plt.xlabel("Relative Importance")

plt.show()

Checking model performance on test data

Accuracy

Recall

Precision

F1

0

0.984

0.879195

0.956204

0.916084

#evaluating the performance of the previously trained model using a test dataset

#and displaying the confusion matrix and performance metrics, accuracy, recall, precision, and F1 score.

model_performance_and_confusion_matrix(model, X_test, y_test)

#storing the performance metrics and confusion matrix of the decision tree model evaluated on the test set into

decision_tree_perf_test = model_performance_and_confusion_matrix(model, X_test, y_test)

decision_tree_perf_test

Accuracy

Recall

Precision

F1

0

0.984

0.879195

0.956204

0.916084

Pre-pruning

Accuracy

Recall

Precision

F1

0

1.0

1.0

1.0

1.0

#setting up a grid search with cross-validation to find the best hyperparameters

#for a Decision Tree classifier using the training data; it then fits the classifier

#with the best parameters to the training data

estimator = DecisionTreeClassifier(random_state=1)

parameters = {

"max_depth": np.arange(6, 15),

"min_samples_leaf": [1, 2, 5, 7, 10],

"max_leaf_nodes": [2, 3, 5, 10],

}

acc_scorer = make_scorer(recall_score)

grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)

grid_obj = grid_obj.fit(X_train, y_train)

estimator = grid_obj.best_estimator_

estimator.fit(X_train, y_train)

#Checks the perfomance on the training data

model_performance_and_confusion_matrix(model, X_train, y_train)

#Checks performance on training data

decision_tree_tune_perf_train = model_performance_and_confusion_matrix(model, X_train, y_train)

DecisionTreeClassifier

DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, min_samples_leaf=10,

random_state=1)

Accuracy

Recall

Precision

F1

0

1.0

1.0

1.0

1.0

Visualizing the Decision Tree

decision_tree_tune_perf_train

#This line plots the decision tree of the best estimator found in the grid search, bottom part will add arrows

plt.figure(figsize=(10, 10))

out = tree.plot_tree(

estimator,

feature_names=feature_names,

filled=True,

fontsize=9,

node_ids=False,

class_names=None,

)

for o in out:

arrow = o.arrow_patch

if arrow is not None:

arrow.set_edgecolor("black")

arrow.set_linewidth(1)

plt.show()

|--- Income <= 116.50

| |--- CCAvg <= 2.95

| | |--- Income <= 106.50

| | | |--- weights: [2553.00, 0.00] class: 0

| | |--- Income > 106.50

| | | |--- weights: [79.00, 10.00] class: 0

| |--- CCAvg > 2.95

| | |--- Income <= 92.50

| | | |--- weights: [117.00, 15.00] class: 0

| | |--- Income > 92.50

| | | |--- Family <= 2.50

| | | | |--- weights: [37.00, 14.00] class: 0

| | | |--- Family > 2.50

| | | | |--- Age <= 57.50

| | | | | |--- weights: [1.00, 20.00] class: 1

| | | | |--- Age > 57.50

| | | | | |--- weights: [7.00, 3.00] class: 0

|--- Income > 116.50

| |--- Family <= 2.50

| | |--- Education_3 <= 0.50

| | | |--- Education_2 <= 0.50

| | | | |--- weights: [375.00, 0.00] class: 0

| | | |--- Education_2 > 0.50

| | | | |--- weights: [0.00, 53.00] class: 1

| | |--- Education_3 > 0.50

| | | |--- weights: [0.00, 62.00] class: 1

| |--- Family > 2.50

| | |--- weights: [0.00, 154.00] class: 1

#prints the structure of the optimized decision tree model in a text format, including the feature names and the

print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))

#displays the feature importances of the optimized decision tree model as a DataFrame, which is sorted in descen

print(

pd.DataFrame(

estimator.feature_importances_, columns=["Imp"], index=X_train.columns

).sort_values(by="Imp", ascending=False)

Imp

Income 0.337681

Family 0.275581

Education_2 0.175687

Education_3 0.157286

CCAvg 0.042856

... ...

ZIPCode_92103 0.000000

ZIPCode_92101 0.000000

ZIPCode_92096 0.000000

ZIPCode_92093 0.000000

ZIPCode_93009 0.000000

[477 rows x 1 columns]

Accuracy

Recall

Precision

F1

0

0.984

0.879195

0.956204

0.916084

)

#plotting a horizontal bar chart to visualize the relative importance of each feature used in the optimized deci

importances = estimator.feature_importances_

indices = np.argsort(importances)

plt.figure(figsize=(8, 8))

plt.title("Feature Importances")

plt.barh(range(len(indices)), importances[indices], color="green", align="center")

plt.yticks(range(len(indices)), [feature_names[i] for i in indices])

plt.xlabel("Relative Importance")

plt.show()

#Checks performance on test data

model_performance_and_confusion_matrix(model, X_test, y_test)

Accuracy

Recall

Precision

F1

0

0.984

0.879195

0.956204

0.916084

#Checks performance on test data

decision_tree_tune_post_test = model_performance_and_confusion_matrix(model, X_test, y_test)

decision_tree_tune_post_test

#using the cost complexity pruning path method to determine the effective alphas

#and the corresponding total impurities at each step for a decision tree classifier trained on the given trainin

clf = DecisionTreeClassifier(random_state=1)

path = clf.cost_complexity_pruning_path(X_train, y_train)

ccp_alphas, impurities = path.ccp_alphas, path.impurities

#creates a DataFrame using the values of the "path" variable, that contains the alpha values

#and impurities from the cost-complexity pruning path of a decision tree.

pd.DataFrame(path)

ccp_alphas

impurities

0

0.000000

0.000000

1

0.000276

0.000552

2

0.000279

0.002224

3

0.000381

0.002605

4

0.000476

0.003081

5

0.000500

0.003581

6

0.000513

0.007174

7

0.000527

0.007701

8

0.000544

0.008246

9

0.000545

0.009882

10

0.000625

0.010507

11

0.000700

0.011207

12

0.000762

0.012731

13

0.000882

0.016260

14

0.000940

0.017200

15

0.001305

0.018505

16

0.001647

0.020153

17

0.002333

0.022486

18

0.002407

0.024893

19

0.003294

0.028187

20

0.006473

0.034659

21

0.025146

0.084951

22

0.039216

0.124167

23

0.047088

0.171255

#This line of code is plotting the total impurity of leaves against the effective alpha values for the training

#extracted from the cost-complexity pruning path of a decision tree, to visualize the impact of different alpha

fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")

ax.set_xlabel("effective alpha")

ax.set_ylabel("total impurity of leaves")

ax.set_title("Total Impurity vs effective alpha for training set")

plt.show()

#then we have to train a decision tree using effective alphas. This line creates a series of decision

#trees each with different ccp_alpha values from the ccp_alphas array and fitting them to the training data,

#then it prints the number of nodes in the last tree and its corresponding ccp_alpha value

clfs = []

for ccp_alpha in ccp_alphas:

clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)

Number of nodes in the last tree is: 1 with ccp_alpha: 0.04708834100596766

Recall vs alpha for training and testing sets

clf.fit(X_train, y_train)

clfs.append(clf)

print(

"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(

clfs[-1].tree_.node_count, ccp_alphas[-1]

)

)

#This line is plotting two graphs: one showing the relationship between the ccp_alpha values

#and the number of nodes in the decision tree, and the other showing the relationship

#between the ccp_alpha values and the depth of the decision tree.

clfs = clfs[:-1]

ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]

depth = [clf.tree_.max_depth for clf in clfs]

fig, ax = plt.subplots(2, 1, figsize=(10, 7))

ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")

ax[0].set_xlabel("alpha")

ax[0].set_ylabel("number of nodes")

ax[0].set_title("Number of nodes vs alpha")

ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")

ax[1].set_xlabel("alpha")

ax[1].set_ylabel("depth of tree")

ax[1].set_title("Depth vs alpha")

fig.tight_layout()

#This line calculating the recall score for both the training and test datasets across various decision tree mo

recall_train = []

for clf in clfs:

pred_train = clf.predict(X_train)

values_train = recall_score(y_train, pred_train)

recall_train.append(values_train)

recall_test = []

for clf in clfs:

pred_test = clf.predict(X_test)

values_test = recall_score(y_test, pred_test)

recall_test.append(values_test)

# plotting the recall scores of the training and testing datasets against different

#values of alpha to visualize how the recall metric varies with the complexity of

#the decision tree (controlled by alpha)

fig, ax = plt.subplots(figsize=(15, 5))

ax.set_xlabel("alpha")

DecisionTreeClassifier(random_state=1)

Post-Pruning

Accuracy

Recall

Precision

F1

0

0.836286

0.933535

0.359302

0.518892

ax.set_ylabel("Recall")

ax.set_title("Recall vs alpha for training and testing sets")

ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")

ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")

ax.legend()

plt.show()

#This line is identifying and printing the best decision tree model

#the one that yields the highest recall score on the test data

#from the list of models trained with different alpha value

index_best_model = np.argmax(recall_test)

best_model = clfs[index_best_model]

print(best_model)

#This line is defining and training a Decision Tree classifier using a specific ccp_alpha value (0.0470883410059

estimator_2 = DecisionTreeClassifier(

ccp_alpha=0.04708834100596766, class_weight={0: 0.15, 1: 0.85}, random_state=1

)

estimator_2.fit(X_train, y_train)

#Checks the performance metrics and confusion matrix of the estimator_2 decision tree model using the training

model_performance_and_confusion_matrix(estimator_2, X_train, y_train)

#calling a function to calculate and store the performance metrics and confusion matrix of the estimator_2 deci

#using the training data in the decision_tree_tune_post_train variable, and then displays it

decision_tree_tune_post_train = model_performance_and_confusion_matrix(estimator_2, X_train, y_train)

decision_tree_tune_post_train

DecisionTreeClassifier

DecisionTreeClassifier(ccp_alpha=0.04708834100596766,

class_weight={0: 0.15, 1: 0.85}, random_state=1)

Accuracy

Recall

Precision

F1

0

0.836286

0.933535

0.359302

0.518892

Visualizing the Decision Tree

#This line will generate a visual representation of the estimator_2 decision tree model, including arrows indica

plt.figure(figsize=(10, 10))

out = tree.plot_tree(

estimator_2,

feature_names=feature_names,

filled=True,

fontsize=9,

node_ids=False,

class_names=None,

)

for o in out:

arrow = o.arrow_patch

if arrow is not None:

arrow.set_edgecolor("black")

arrow.set_linewidth(1)

plt.show()

|--- Income <= 98.50

| |--- weights: [392.70, 18.70] class: 0

|--- Income > 98.50

| |--- weights: [82.65, 262.65] class: 1

Imp

Income 1.0

ZIPCode_94123 0.0

ZIPCode_94306 0.0

ZIPCode_94305 0.0

ZIPCode_94304 0.0

... ...

ZIPCode_92069 0.0

ZIPCode_92068 0.0

ZIPCode_92064 0.0

ZIPCode_92056 0.0

Education_3 0.0

[477 rows x 1 columns]

#printing a text-based representation of the estimator_2 decision tree model, including feature names and node w

print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))

#showing the feature importances of the estimator_2 decision tree model, sorted in descending order of importan

print(

pd.DataFrame(

estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns

).sort_values(by="Imp", ascending=False)

)

#creating a horizontal bar plot to visualize the relative importance of features in the

#estimator_2 decision tree model, with features sorted by importance.

#The bars are colored green for better visualization.

importances = estimator_2.feature_importances_

indices = np.argsort(importances)

plt.figure(figsize=(10, 6))

plt.title("Feature Importances")

Accuracy

Recall

Precision

F1

0

0.823333

0.90604

0.349741

0.504673

Accuracy

Recall

Precision

F1

0

0.823333

0.90604

0.349741

0.504673

plt.barh(range(len(indices)), importances[indices], color="green", align="center")

plt.yticks(range(len(indices)), [feature_names[i] for i in indices])

plt.xlabel("Relative Importance")

plt.show()

#This line is evaluating the performance of the estimator_2 model on

#the test data and generating a confusion matrix along with various classification metrics.

model_performance_and_confusion_matrix(estimator_2, X_test, y_test)

#This line evaluates the performance of the estimator_2 model on the test data and

#generates a confusion matrix along with various classification metrics, and then displaying the results.

decision_tree_tune_post_test = model_performance_and_confusion_matrix(estimator_2, X_test, y_test)

decision_tree_tune_post_test

Model Comparison and Final Model Selection

Training performance comparison:

Decision Tree sklearn

Decision Tree (Pre-Pruning)

Decision Tree (Post-Pruning)

Accuracy

1.0

1.0

0.836286

Recall

1.0

1.0

0.933535

Precision

1.0

1.0

0.359302

F1

1.0

1.0

0.518892

Setup Accuracy Recall Precision F1

0 Setup 1 0.978667 0.785235 1.000000 0.879699

0 Setup 2 0.823333 0.906040 0.349741 0.504673

#This line of code is creating a DataFrame to compare the training performance

#metrics of three decision tree models: one without pruning,

#one with pre-pruning, and one with post-pruning

models_train_comp_df = pd.concat(

[decision_tree_perf_train.T, decision_tree_tune_perf_train.T,decision_tree_tune_post_train.T], axis=1,

)

models_train_comp_df.columns = ["Decision Tree sklearn", "Decision Tree (Pre-Pruning)","Decision Tree (Post-Pru

print("Training performance comparison:")

models_train_comp_df

#This line computes and compares the performance metrics of two decision tree models on the test data,

#and then combining the results into a DataFrame for comparison

perf_metrics_1 = model_performance_and_confusion_matrix(estimator, X_test, y_test)

perf_metrics_2 = model_performance_and_confusion_matrix(estimator_2, X_test, y_test)

perf_metrics_combined = pd.concat([perf_metrics_1, perf_metrics_2], axis=0)

perf_metrics_combined['Setup'] = ['Setup 1', 'Setup 2']

cols = ['Setup'] + [col for col in perf_metrics_combined if col != 'Setup']

perf_metrics_combined = perf_metrics_combined[cols]

print(perf_metrics_combined)

Actionable Insights and Business Recommendations

Questions:

1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?

The majority of the mortgage attributes are clustered under $200,000, so that would be a right-skewed distribution.

1. How many customers have credit cards?

3700 out of 5000 customers own credit cards.

1. What are the attributes that have a strong correlation with the target attribute (personal loan)?

The characteristics with the highest correlation to personal loan acceptance are:

Credit Card ownership with a correlation coefficient of 0.45

Family size with a correlation coefficient of 0.42

Holding a securities account with a correlation coefficient of 0.41

1. How does a customer's interest in purchasing a loan vary with their age?

There is a=decline in the inclination to pursue a loan as age increases. The age group displaying the greatest willingness to accept

loans is between 30 to 50 years old.

1. How does a customer's interest in purchasing a loan vary with their education?

A rising trend in loan acceptance is noted with increasing education. Customers holding graduate degrees are the most likely to

accept loans, with an acceptance rate of 15.8%. This is followed by those with undergraduate degrees at a 9.9% acceptance rate,

and individuals with advanced or professional degrees, with a 7.3% acceptance rate.

What recommedations would you suggest to the bank?

Create customized loan products tailored to individuals holding graduate or professional degrees, as they are likely to have a higher

conversion rate. Additionally, target existing customers who hold securities and CD accounts with personalized personal loan

offerings through cross-selling initiatives.

Develop marketing campaigns that are uniquely aimed at younger individuals with higher income levels, with a strong emphasis on

crafting specialized financial products that align with their financial goals and ambitions. Income and age appear to be more

influential factors compared to other variables considered.

Utilize the pruned decision tree model to pinpoint and focus on customer profiles with the most potential, particularly those

demonstrating strong recall performance, as part of your targeting strategy.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js